{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Estimate data load of dictionary expansion method\n",
    "\n",
    "This notebook takes the Dolinsky-Huber-Horne (DHH) social group keywords dictionary and estimates how many terms would require manual coding if one applied word-embedding based dictionary expansion to discover new keywords not yet in the dictionary.\n",
    "\n",
    "Word-embedding based dictionary expansion is a method for finding likely relavant keywords based on a set of \"seed\" keywords.\n",
    "For each dictionary keyword, one uses the embedding model to find the $k$ most similar words (according to cosine similarity) to the keyword.\n",
    "These $k$ words are candidates for addition to the dictionary. \n",
    "\n",
    "Importantly, while automating the search for keyword *candidate*, its a best practice to manually review candidate terms to decide whether or not a given term should be included in the dictionary.\n",
    "Thus, word-embedding based dictionary expansion requires (expert) coding (a.k.a. supervision).\n",
    "\n",
    "This need for a human in the loop implies that this method is not free from manual labor.\n",
    "The amount of manual labor required to implement it, in turn, increases in (i) $n$, the number of seed keywords in the dictionary, and $k$, the number candidate terms that should be considered for each seed keyword.\n",
    "\n",
    "In our analysis, $n$ is fixed because we start with the DHH dictionary.\n",
    "Hence we vary $k$ between 10 and 100 to compute the number of words that required manual coding in the dictionary expansion process.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from types import SimpleNamespace\n",
    "\n",
    "args = SimpleNamespace()\n",
    "\n",
    "args.experiment_name = 'dictionary_expansion'\n",
    "args.experiment_results_path = '../../../results/validation/dhh_dictionary'\n",
    "\n",
    "args.keywords_file = '../../../data/exdata/dhh_dictionary/keywords.csv'\n",
    "args.gensim_embedding_model = 'word2vec-google-news-300'\n",
    "args.nearest_neighbors_values = '10,25,50,100'\n",
    "\n",
    "args.output_folder = '../../../data/validation/dhh_dictionary/dictionary_expansion'\n",
    "args.overwrite_output = False\n",
    "\n",
    "# parse the arguments\n",
    "\n",
    "args.nearest_neighbors_values = [int(x.strip()) for x in args.nearest_neighbors_values.split(',')]\n",
    "args.nearest_neighbors_values = sorted(args.nearest_neighbors_values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "\n",
    "import pandas as pd\n",
    "import re\n",
    "\n",
    "import gensim.downloader as api\n",
    "\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "from typing import List, Tuple"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load dictionary and prepare the keywords\n",
    "\n",
    "Keywords in the DHH dictionaries use glob patterns. \n",
    "We need to convert them to regex.\n",
    "\n",
    "Moreover, any words in the embedding model's vocabulary that are n-grams with n ≥ 2 do not contain under scores as token delimiter (not whitespaces).\n",
    "Hence, we need to pre-process the dictionary keywords accordingly to allow mapping them to the word vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the dictionary\n",
    "keywords_wide = pd.read_csv(args.keywords_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert the glob keywords into regex and make them compatible with encoding style of word2vec vocabulary (e.g., whitespaces are _ in n-grams)\n",
    "def glob_to_regex(glob):\n",
    "    \"\"\"\n",
    "    Convert a glob pattern to a regular expression.\n",
    "    \"\"\"\n",
    "    # Escape all characters that are special in regex, except for *, ?, and []\n",
    "    regex = re.escape(glob)\n",
    "    # make compatible with word2vec encoding\n",
    "    regex = regex.replace('\\\\ ', '_')\n",
    "    \n",
    "    # Replace the escaped glob wildcards with regex equivalents (given encoding of word2vec vocabulary)\n",
    "    regex = regex.replace(r'\\*', '[^_]*?')\n",
    "    regex = regex.replace(r'\\?', '.')\n",
    "    regex = regex.replace(r'\\[', '[')\n",
    "    regex = regex.replace(r'\\]', ']')\n",
    "    \n",
    "    # Add anchors to match the entire string\n",
    "    regex = r'^' + regex + '$'\n",
    "\n",
    "    \n",
    "    return regex\n",
    "\n",
    "keywords = {c: [glob_to_regex(v.strip()) for v in vals if not pd.isna(v)] for c, vals in keywords_wide.T.iterrows()}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## load a pre-trained Word2Vec embedding model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = api.load(args.gensim_embedding_model)  # This will (down)load the Google News pre-trained Word2Vec model\n",
    "# get the model vocabulary\n",
    "model_vocab = list(model.key_to_index.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find dictionary keywords' word vectors\n",
    "\n",
    "To look up similar terms for a dictionar keyword in the embedding model, we need to find each dictionary keyword's word vectors.\n",
    "This requires checking whether the keyword is in the embedding model's vocabulary.\n",
    "\n",
    "One problem of applying the dictionary expansion method to dictionaries that contain multi-word expressions is that the concreate $n$-grams recorded in the dictionary are not in the embedding model's vocabulary.\n",
    "\n",
    "Below, we address this issue by recursively chunking a keyword that cannot be found in the embedding model's dictionary into its constituent $n$-grams and checking whether they can be found in the vocabulary. We repear this recursive chunking up to three times.\n",
    "\n",
    "**_Example 1_**:\n",
    "\n",
    "The bi-gram keyword \"business_people\" is not in the vocabulary.\n",
    "We chunk it into its uni-gram words (splitting at the token separator '_') and check whether each is in the vocabulary.\n",
    "\n",
    "**_Example 2_**:\n",
    "\n",
    "The 7-gram keyword \"people_living_and_working_in_rural_areas\" is not in the vocabulary.\n",
    "We first chunk it into its constituent 6-grams (\"people_living_and_working_in_rural\" and \"living_and_working_in_rural_areas\") and check whether they are in the vocabulary. None is.\n",
    "So we chunk each into its constituent 5-grams and check whether these are in the vocabulary.\n",
    "Etc.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# helper functions\n",
    "def find_in_vocabulary(term: str, model_vocab: List[str]) -> List[Tuple[int, str]]:\n",
    "    return [(i, w) for i, w in enumerate(model_vocab) if re.search(term, w)]\n",
    "# # Example\n",
    "# find_in_vocabulary('^worker[^_]*?$', model_vocab)\n",
    "\n",
    "def split(x: str, pattern):\n",
    "    return re.split(pattern, x)\n",
    "\n",
    "def segment_into_ngrams(x: str, n: int=1, pattern: str='(?<!^)_(?!\\])', sep: str='_') -> List[str]:\n",
    "    x = split(x, pattern)\n",
    "    if len(x) < n:\n",
    "        return [x]\n",
    "    elif n > len(x):\n",
    "        return [sep.join(x)]\n",
    "    else:\n",
    "        out = ['^' + sep.join(x[i:i+n]) + '$' for i in range(0, len(x)-n+1)]\n",
    "        out[0] = out[0].replace('^^', '^')\n",
    "        out[-1] = out[-1].replace('$$', '$')\n",
    "        out = [s for s in out if s != '^[^_]*?$']\n",
    "        return out\n",
    "# # Examples\n",
    "# print(segment_into_ngrams(\"people_living_and_working_in_rural_areas\", n=1))\n",
    "# print(segment_into_ngrams(\"people_living_and_working_in_rural_areas\", n=2))\n",
    "# print(segment_into_ngrams(\"people_living_and_working_in_rural_areas\", n=3))\n",
    "# print(segment_into_ngrams(\"people_living_and_working_in_rural_areas\", n=4))\n",
    "# print(segment_into_ngrams(\"people_living_and_working_in_rural_areas\", n=5))\n",
    "# print(segment_into_ngrams(\"people_living_and_working_in_rural_areas\", n=6))\n",
    "# print(segment_into_ngrams(\"people_living_and_working_in_rural_areas\", n=7))\n",
    "\n",
    "def find_keyword_in_vocabulary(\n",
    "        kw: str, \n",
    "        model_vocab: List[str],\n",
    "        max_splits: int=3\n",
    "    ) -> List[Tuple[str, str, int, int, str]]:\n",
    "    \"\"\"Find dictionary keyword in word2vec model vocabulary.\n",
    "\n",
    "    Args:\n",
    "        kw (str): dictionary keyword (regex pattern)\n",
    "        model_vocab (List[str]): word2vec model vocabulary\n",
    "\n",
    "    Returns:\n",
    "        List[Tuple[str, str, int, int, str]]: list of tuples with keyword, keyword segment, times splitted, index of matched term in vocabulary, and vocabulary word\n",
    "    \"\"\"\n",
    "    vocab_terms = find_in_vocabulary(kw, model_vocab)\n",
    "    n = len(split(kw, pattern='(?<!^)_(?!\\])'))\n",
    "    if len(vocab_terms) > 0:\n",
    "        return [(kw, None, 0, t[0], t[1]) for t in vocab_terms]\n",
    "    terms = [kw]\n",
    "    vocab_terms = []\n",
    "    while len(terms) > 0:\n",
    "        term = terms.pop(0)\n",
    "        n_ = len(split(term, pattern='(?<!^)_(?!\\])'))-1\n",
    "        if n-n_ > max_splits:\n",
    "            break\n",
    "        if n_ == 0:\n",
    "            continue\n",
    "        segs = segment_into_ngrams(term, n=n_)\n",
    "        for s in segs:\n",
    "            tmp = find_in_vocabulary(s, model_vocab)\n",
    "            if len(tmp) > 0:\n",
    "                tmp = [(kw, s, n-n_, t[0], t[1]) for t in tmp]\n",
    "                vocab_terms.extend(tmp)\n",
    "            else:\n",
    "                terms.append(s)\n",
    "    return vocab_terms\n",
    "# # examples\n",
    "# print(find_keyword_in_vocabulary('^employee[^_]*?$', model_vocab))\n",
    "# print(find_keyword_in_vocabulary('^employer[^_]*?$', model_vocab))\n",
    "# print(find_keyword_in_vocabulary('^business_people$', model_vocab))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a8b7b1a992744c5e90c2039f0346204d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "searching matches for dictionary keywords in model vocabulary:   0%|          | 0/43 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# get matches for words in the dictionary\n",
    "# note: this can take 20-25 minutes\n",
    "matches = []\n",
    "for cat, kws in tqdm(keywords.items(), desc='searching matches for dictionary keywords in model vocabulary'):\n",
    "    for kw in kws:\n",
    "        matched = find_keyword_in_vocabulary(kw, model_vocab)\n",
    "        matches.extend([(cat, ) + m for m in matched])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>keyword</th>\n",
       "      <th>keyword_segment</th>\n",
       "      <th>keyword_splitted_n_times</th>\n",
       "      <th>match_vocabulary_index</th>\n",
       "      <th>match_vocabulary_word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>var1</td>\n",
       "      <td>^employee[^_]*?$</td>\n",
       "      <td>None</td>\n",
       "      <td>0</td>\n",
       "      <td>725</td>\n",
       "      <td>employees</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>var1</td>\n",
       "      <td>^employee[^_]*?$</td>\n",
       "      <td>None</td>\n",
       "      <td>0</td>\n",
       "      <td>2019</td>\n",
       "      <td>employee</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>var1</td>\n",
       "      <td>^employee[^_]*?$</td>\n",
       "      <td>None</td>\n",
       "      <td>0</td>\n",
       "      <td>758697</td>\n",
       "      <td>employeed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>var1</td>\n",
       "      <td>^employee[^_]*?$</td>\n",
       "      <td>None</td>\n",
       "      <td>0</td>\n",
       "      <td>1312688</td>\n",
       "      <td>employees.The</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>var1</td>\n",
       "      <td>^employee[^_]*?$</td>\n",
       "      <td>None</td>\n",
       "      <td>0</td>\n",
       "      <td>1511655</td>\n",
       "      <td>employeers</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  category           keyword keyword_segment  keyword_splitted_n_times  \\\n",
       "0     var1  ^employee[^_]*?$            None                         0   \n",
       "1     var1  ^employee[^_]*?$            None                         0   \n",
       "2     var1  ^employee[^_]*?$            None                         0   \n",
       "3     var1  ^employee[^_]*?$            None                         0   \n",
       "4     var1  ^employee[^_]*?$            None                         0   \n",
       "\n",
       "   match_vocabulary_index match_vocabulary_word  \n",
       "0                     725             employees  \n",
       "1                    2019              employee  \n",
       "2                  758697             employeed  \n",
       "3                 1312688         employees.The  \n",
       "4                 1511655            employeers  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# convert the result of this process to a data frame\n",
    "cols = ['category', 'keyword', 'keyword_segment', 'keyword_splitted_n_times', 'match_vocabulary_index', 'match_vocabulary_word']\n",
    "matched_df = pd.DataFrame(matches, columns=cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data frame records the following information:\n",
    "\n",
    "- `category`: the dictionary category of the keyword\n",
    "- `keyword`: the keyword (in its pre-processed form)\n",
    "- `keyword_segment`: the segment of the keyword that was found in the embedding model's vocabulary. None if ``keyword`` itsself was found in the vocabulary.\n",
    "- `keyword_splitted_n_times`: the number of times the $n$-gram keyword was splitted to find matches in the embedding model's vocabulary.\n",
    "- `match_vocabulary_index`: the index of the keyword (segment) in the embedding model's vocabulary.\n",
    "- `match_vocabulary_word`: the word matched to the keyword (segment) in the embedding model's vocabulary.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find nearest neighbors\n",
    "\n",
    "For each keyword (segment) matched to a word in the word embedding model's vocabulary, we find the $k$ nearest neighbors in the embedding model's vocabulary.\n",
    "Below we first find the top-100 most similar words and below check how reducing $k$ from 100 to 50, 25, and 10 affects the numbers of terms that would require manual checking."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f37d9b2fc6c74d38a4d4f2205d64762a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "searching nearest neighbors for matched model vocabulary terms:   0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "k = max(args.nearest_neighbors_values)\n",
    "nearest_neighbors = [\n",
    "    (c, w, v, s, r) \n",
    "    for c, w in tqdm(\n",
    "        zip(matched_df['category'], matched_df['match_vocabulary_word']),\n",
    "        desc='searching nearest neighbors for matched model vocabulary terms',\n",
    "        total=len(matched_df),\n",
    "    )\n",
    "    for r, (v, s) in enumerate(model.most_similar(w, topn=k))\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert to DataFrame\n",
    "nearest_neighbors_df = pd.DataFrame(nearest_neighbors, columns=['category', 'word', 'neighbor', 'similarity', 'rank'])\n",
    "nearest_neighbors_df['neighbor'] = nearest_neighbors_df.neighbor.str.lower()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compute number of unique words in nearest neighbors set that would require manual review"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs(args.output_folder, exist_ok=True)\n",
    "keywords_file_template = 'dhh_dictionary_expanded_keywords_k{k}.json'\n",
    "\n",
    "df_to_json = lambda x: {c: words[0].tolist() for c, words in x.groupby('category').agg({'neighbor': lambda x: x.str.replace('_', ' ')}).iterrows()}\n",
    "\n",
    "counts = dict()\n",
    "for k in args.nearest_neighbors_values:\n",
    "    # subset to k nearest neighbors\n",
    "    tmp = nearest_neighbors_df[nearest_neighbors_df['rank'] < k]\n",
    "    \n",
    "    # save keywords to file\n",
    "    fp = os.path.join(args.output_folder, keywords_file_template.format(k=k))\n",
    "    if not os.path.exists(fp) or args.overwrite_output:\n",
    "        with open(fp, 'w') as f:\n",
    "            json.dump(df_to_json(tmp), f, indent=2)\n",
    "\n",
    "    # group by `neighbor` and count the number of times each neighbor appears and its mean similarity\n",
    "    unique_neighbors_df = tmp.groupby('neighbor').agg({'word': 'count', 'similarity': 'mean'}).sort_values('similarity', ascending=False).reset_index().rename(columns={'word': 'n_times', 'similarity': 'mean_similarity', 'neighbor': 'word'})\n",
    "    \n",
    "    # flag words that already appear in the dictionary\n",
    "    kws = [val for vals in keywords.values() for val in vals]\n",
    "    unique_neighbors_df['in_dictionary'] = unique_neighbors_df.word.apply(lambda x: any([bool(re.search(re.compile(kw, flags=re.IGNORECASE), x)) for kw in kws]))\n",
    "    \n",
    "    # tabulate existing/new words\n",
    "    counts[k] = unique_neighbors_df.in_dictionary.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = pd.DataFrame(counts)\n",
    "result.index.names = ['in_dictionary']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The table shows that if one would find the top-10 most similar words with the embedding model for each dictionary keyword (segment), there would be 4119 unique terms that would require manual checking (242 words in the nearest neighbors set would require no checking because they already in the set of dictionary keywords).\n",
    "The number of unique terms that would require manual checking increases to 10100 if one would increase $k$ to 25 to examine the 25 (not 10) most similar terms to each dictionary keyword (segment).\n",
    "(This is alraedy more than the number of labeled sentences we need for our fine-tuning approach.)\n",
    "At latest when one wants to examine the 50 most similar terms to each dictionary keyword (segment), the number of unique terms that would require manual checking approaches 20K."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dest = os.path.join(args.experiment_results_path, args.experiment_name)\n",
    "os.makedirs(dest, exist_ok=True)\n",
    "fp = os.path.join(dest, 'dictionary_expansion_words_to_review_estimates.csv')\n",
    "if not os.path.exists(fp) or args.overwrite_output:\n",
    "    result.to_csv(fp)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "group_mention_detection",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
